The second data product for my university course. It’s aim is to predict housing prices using regression.
The used data is the Ames Housing dataset, from kaggle.com It’s split into two files, a train.csv and a test.csv.
Created by: Dobosi Péter MW79ON
First let’s decide what the problem exactly is, and what do we want to achieve. The exact overview of the problem can be read here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview
In summary, we have to create a model, which can predict the sale price of a house, given it’s parameters, using regression.
My plan is the following:
First things first, let’s load the data and take a quick look at it:
data <- read.csv("./train.csv")
data
We have more than 1400 objects and 79 parameters. We have NAs and everything is simply either characters or integers.
To properly prepare the data we have to handle the empty values one way or another, and also convert the characters to factors.
It seems like we have a couple of parameters that are mostly NAs:
alley_na_percentage <- sum(is.na(data$Alley))/1460*100
pool_na_percentage <- sum(is.na(data$PoolQC))/1460*100
fence_na_percentage <- sum(is.na(data$Fence))/1460*100
feature_na_percentage <- sum(is.na(data$MiscFeature))/1460*100
paste("Percentage of NAs in the field Alley:", alley_na_percentage, "%")
[1] "Percentage of NAs in the field Alley: 93.7671232876712 %"
paste("Percentage of NAs in the field PoolQC:", pool_na_percentage, "%")
[1] "Percentage of NAs in the field PoolQC: 99.5205479452055 %"
paste("Percentage of NAs in the field Fence:", fence_na_percentage, "%")
[1] "Percentage of NAs in the field Fence: 80.7534246575342 %"
paste("Percentage of NAs in the field MiscFeature:", feature_na_percentage, "%")
[1] "Percentage of NAs in the field MiscFeature: 96.3013698630137 %"
With the help of data_description.txt we can decipher what do these mean:
(Only the relevant parts here, read the rest from the file if you are interested.)
Alley:
Type of alley access to property.
Grvl Gravel
Pave Paved
NA No alley access
PoolQC:
Pool quality.
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
Fence:
Fence quality.
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
MiscFeature:
Miscellaneous feature not covered in other categories.
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
As we can see, they don’t mean that we have no information on those parameters of the buildings. Rather their meaning is simply that they lack the things described by those parameters. This is important information, we can’t just drop, or guess them from based on the others.
Let’s find out more about the data’s characteristics!
Let’s take a quick look at the parameters of the dataset:
names(data)
[1] "Id" "MSSubClass" "MSZoning" "LotFrontage" "LotArea" "Street" "Alley"
[8] "LotShape" "LandContour" "Utilities" "LotConfig" "LandSlope" "Neighborhood" "Condition1"
[15] "Condition2" "BldgType" "HouseStyle" "OverallQual" "OverallCond" "YearBuilt" "YearRemodAdd"
[22] "RoofStyle" "RoofMatl" "Exterior1st" "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
[29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond" "BsmtExposure" "BsmtFinType1" "BsmtFinSF1"
[36] "BsmtFinType2" "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating" "HeatingQC" "CentralAir"
[43] "Electrical" "X1stFlrSF" "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath" "BsmtHalfBath"
[50] "FullBath" "HalfBath" "BedroomAbvGr" "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
[57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt" "GarageFinish" "GarageCars" "GarageArea"
[64] "GarageQual" "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF" "EnclosedPorch" "X3SsnPorch"
[71] "ScreenPorch" "PoolArea" "PoolQC" "Fence" "MiscFeature" "MiscVal" "MoSold"
[78] "YrSold" "SaleType" "SaleCondition" "SalePrice"
As we can see, we have roughly 80 parameters.
Let’s take a look at the types of the parameters:
str(data)
'data.frame': 1460 obs. of 81 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
$ MSZoning : chr "RL" "RL" "RL" "RL" ...
$ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
$ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
$ Street : chr "Pave" "Pave" "Pave" "Pave" ...
$ Alley : chr NA NA NA NA ...
$ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
$ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
$ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
$ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
$ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
$ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
$ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
$ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
$ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
$ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
$ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
$ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
$ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
$ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
$ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
$ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
$ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
$ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
$ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
$ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
$ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
$ ExterCond : chr "TA" "TA" "TA" "TA" ...
$ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
$ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
$ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
$ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
$ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
$ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
$ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
$ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
$ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
$ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
$ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
$ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
$ CentralAir : chr "Y" "Y" "Y" "Y" ...
$ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
$ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
$ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
$ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
$ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
$ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
$ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
$ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
$ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
$ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
$ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
$ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
$ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
$ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
$ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
$ FireplaceQu : chr NA "TA" "TA" "Gd" ...
$ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
$ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
$ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
$ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
$ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
$ GarageQual : chr "TA" "TA" "TA" "TA" ...
$ GarageCond : chr "TA" "TA" "TA" "TA" ...
$ PavedDrive : chr "Y" "Y" "Y" "Y" ...
$ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
$ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
$ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
$ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
$ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
$ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
$ PoolQC : chr NA NA NA NA ...
$ Fence : chr NA NA NA NA ...
$ MiscFeature : chr NA NA NA NA ...
$ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
$ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
$ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
$ SaleType : chr "WD" "WD" "WD" "WD" ...
$ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
$ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
All of our parameters are either character or integer vectors.
Let’s take a look at the summary:
summary(data)
Id MSSubClass MSZoning LotFrontage LotArea Street
Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00 Min. : 1300 Length:1460
1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00 1st Qu.: 7554 Class :character
Median : 730.5 Median : 50.0 Mode :character Median : 69.00 Median : 9478 Mode :character
Mean : 730.5 Mean : 56.9 Mean : 70.05 Mean : 10517
3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602
Max. :1460.0 Max. :190.0 Max. :313.00 Max. :215245
NA's :259
Alley LotShape LandContour Utilities LotConfig LandSlope
Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460
Class :character Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual
Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Min. : 1.000
Class :character Class :character Class :character Class :character Class :character 1st Qu.: 5.000
Mode :character Mode :character Mode :character Mode :character Mode :character Median : 6.000
Mean : 6.099
3rd Qu.: 7.000
Max. :10.000
OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st
Min. :1.000 Min. :1872 Min. :1950 Length:1460 Length:1460 Length:1460
1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Class :character Class :character Class :character
Median :5.000 Median :1973 Median :1994 Mode :character Mode :character Mode :character
Mean :5.575 Mean :1971 Mean :1985
3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
Max. :9.000 Max. :2010 Max. :2010
Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
Length:1460 Length:1460 Min. : 0.0 Length:1460 Length:1460 Length:1460
Class :character Class :character 1st Qu.: 0.0 Class :character Class :character Class :character
Mode :character Mode :character Median : 0.0 Mode :character Mode :character Mode :character
Mean : 103.7
3rd Qu.: 166.0
Max. :1600.0
NA's :8
BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
Length:1460 Length:1460 Length:1460 Length:1460 Min. : 0.0 Length:1460
Class :character Class :character Class :character Class :character 1st Qu.: 0.0 Class :character
Mode :character Mode :character Mode :character Mode :character Median : 383.5 Mode :character
Mean : 443.6
3rd Qu.: 712.2
Max. :5644.0
BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460 Length:1460 Length:1460
1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character Class :character Class :character
Median : 0.00 Median : 477.5 Median : 991.5 Mode :character Mode :character Mode :character
Mean : 46.55 Mean : 567.2 Mean :1057.4
3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
Max. :1474.00 Max. :2336.0 Max. :6110.0
Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
Length:1460 Min. : 334 Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
Class :character 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
Mode :character Median :1087 Median : 0 Median : 0.000 Median :1464 Median :0.0000
Mean :1163 Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
Max. :4692 Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000 Length:1460
1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Class :character
Median :0.00000 Median :2.000 Median :0.0000 Median :3.000 Median :1.000 Mode :character
Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866 Mean :1.047
3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000 Max. :3.000
TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
Min. : 2.000 Length:1460 Min. :0.000 Length:1460 Length:1460 Min. :1900
1st Qu.: 5.000 Class :character 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
Median : 6.000 Mode :character Median :1.000 Mode :character Mode :character Median :1980
Mean : 6.518 Mean :0.613 Mean :1979
3rd Qu.: 7.000 3rd Qu.:1.000 3rd Qu.:2002
Max. :14.000 Max. :3.000 Max. :2010
NA's :81
GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
Length:1460 Min. :0.000 Min. : 0.0 Length:1460 Length:1460 Length:1460
Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character Class :character Class :character
Mode :character Median :2.000 Median : 480.0 Mode :character Mode :character Mode :character
Mean :1.767 Mean : 473.0
3rd Qu.:2.000 3rd Qu.: 576.0
Max. :4.000 Max. :1418.0
WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
PoolQC Fence MiscFeature MiscVal MoSold YrSold
Length:1460 Length:1460 Length:1460 Min. : 0.00 Min. : 1.000 Min. :2006
Class :character Class :character Class :character 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007
Mode :character Mode :character Mode :character Median : 0.00 Median : 6.000 Median :2008
Mean : 43.49 Mean : 6.322 Mean :2008
3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009
Max. :15500.00 Max. :12.000 Max. :2010
SaleType SaleCondition SalePrice
Length:1460 Length:1460 Min. : 34900
Class :character Class :character 1st Qu.:129975
Mode :character Mode :character Median :163000
Mean :180921
3rd Qu.:214000
Max. :755000
A lot of our attributes are character vectors, which we can’t summarize this way.
In which we find out more about given parameters of the data.
Let’s take a visual look:
plot(data$LotArea)
plot(data$LotArea, ylim = c(1000, 20000))
We can see, that in terms of Area, most of the properties are between 1000 and 20000 square feet, with a few outliers.
Let’s check out how many houses were built in each year:
hist(data$YearBuilt)
We can also see, a tendency towards newly built homes.
plot(table(data$Fireplaces))
grid()
Multidimensional examination:
Now let’s take a look at multiple parameters at the same time:
library(car)
scatterplot(data$YearBuilt, data$SalePrice, regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))
We might see some kind of exponential pattern, given the higher prices of newly built homes.
scatterplot(data$LotArea, data$SalePrice, xlim = c(1000, 20000), regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))
We can’t see any clear correlation between the area and the price of a property.
Just out of curiosity, I tried to draw the feature plot of the data. To my surprise, with a couple of tweaks, and given some time it actually worked:
# This took a couple minutes, but it worked. It's about 64k square meters.
# The plot is basically unreadable, but it shows that there is correlation
# between a couple of the parameters.
# pdf(file = "/home/peter/test.pdf",
# width = 10000,
# height = 10000)
# plot(data)
# dev.off()
Although we couldn’t really learn anything from it, due to it’s size it’s unreadable.
Let’s check out the covariance matrix:
# cov(data)
This doesn’t work because there are non numeric or logical values in our dataframe still. It’s time we cleaned up the data a bit. But before we do that, let’s try to take a look at the correlations between a couple other parameters.
library(corrplot)
cors <- cor(data[,c(5,19,20,39,47,63,72,77,81)], use = "complete.obs")
corrplot(cors, type = "lower")
We can see correlation between the price and a couple of parameters, such as the size of the living area.
There is a lot to do. We have NAs, and non numeric values everywhere. Let’s start by dealing with the NAs first.
We need to find out which columns contain any NAs:
na_cols <- names(which(colSums(is.na(data)) > 0))
na_cols
[1] "LotFrontage" "Alley" "MasVnrType" "MasVnrArea" "BsmtQual" "BsmtCond" "BsmtExposure"
[8] "BsmtFinType1" "BsmtFinType2" "Electrical" "FireplaceQu" "GarageType" "GarageYrBlt" "GarageFinish"
[15] "GarageQual" "GarageCond" "PoolQC" "Fence" "MiscFeature"
We have 19 columns containing NAs, let’s find out more about them.
Let’s find out how many NAs do these columns have:
get_na_count <- function(column_name) {
sum(is.na(data[column_name]))
}
na_counts <- data.frame(sapply(na_cols, get_na_count))
library(data.table)
na_stats <- transpose(na_counts)
colnames(na_stats) <- na_cols
rownames(na_stats) <- c("NA count")
calc_na_percentage <- function(column_name) {
get_na_count(column_name = column_name)/nrow(data) * 100
}
na_stats[nrow(na_stats) + 1,] = sapply(na_cols, calc_na_percentage)
rownames(na_stats) <- c("NA count", "NA percentage")
na_stats
As we can see, we have some parameters that are mostly NAs, while others only contain a few of them.
Let’s deal with them appropriately, now that we know more about them.
First, the LotFrontage parameter:
unique(data$LotFrontage)
[1] 65 80 68 60 84 85 75 NA 51 50 70 91 72 66 101 57 44 110 98 47 108 112 74 115 61 48 33
[28] 52 100 24 89 63 76 81 95 69 21 32 78 121 122 40 105 73 77 64 94 34 90 55 88 82 71 120
[55] 107 92 134 62 86 141 97 54 41 79 174 99 67 83 43 103 93 30 129 140 35 37 118 87 116 150 111
[82] 49 96 59 36 56 102 58 38 109 130 53 137 45 106 104 42 39 144 114 128 149 313 168 182 138 160 152
[109] 124 153 46
The description doesn’t say anything about NAs in this parameter, but as we can see, there aren’t any zeros here. So I’ll assume that NAs mean zero here as well, as it does in most of the other parameters.
Let’s fill them in now:
data[is.na(data$LotFrontage),]$LotFrontage <- 0
The Alley parameter:
The data_description.txt says that NAs in this parameter mean, that there is no alley access, to the given property.
unique(data$Alley)
[1] NA "Grvl" "Pave"
Later I’ll probably convert all character vectors to factors, so let’s leave this as is.
Now for the Masonry veneer type:
ms_types <- unique(data$MasVnrType)
ms_types
[1] "BrkFace" "None" "Stone" "BrkCmn" NA
We have a handful of NAs, but here they do not simply mean that there is no such thing as what’s being described by the parameter. We have to actually fill them in.
Let’s do so by the most frequent value:
get_ms_count <- function(unique_value){
sum(data$MasVnrType == unique_value, na.rm = T)
}
sapply(ms_types, get_ms_count)
BrkFace None Stone BrkCmn <NA>
445 864 128 15 0
As we can see, the most common option is None, so let’s assume that NAs are None:
data[is.na(data$MasVnrType),]$MasVnrType <- "None"
We have to do the same for Masonry veneer area as well, but with 0s this time:
data[is.na(data$MasVnrArea),]$MasVnrArea <- 0
BsmtQual is next:
unique(data$BsmtQual)
[1] "Gd" "TA" "Ex" NA "Fa"
According to the description, NAs here mean, that the property has no basement. Let’s leave this as is.
The same is true for BsmtCond, BsmtExposure, BsmtFinSF1, BsmtFinType1 and BsmtFinType2.
Electrical is up next:
elec_types <- unique(data$Electrical)
elec_types
[1] "SBrkr" "FuseF" "FuseA" "FuseP" "Mix" NA
The description doesn’t say anything about the one missing value, so let’s fill it with the most frequent value:
# TODO I need to change these to reusable methods.
get_elec_count <- function(unique_value){
sum(data$Electrical == unique_value, na.rm = T)
}
sapply(elec_types, get_elec_count)
SBrkr FuseF FuseA FuseP Mix <NA>
1334 27 94 3 1 0
As we can see, the Standard Breaker is the most common, let’s assume the missing value is that:
data[is.na(data$Electrical),]$Electrical <- "SBrkr"
FireplaceQu is next:
The description says that NAs here mean that there is no fireplace, so let’s leave this as is.
The same deal for all the parameters describing the garages.
PoolQC and Fence also behave the exact same way.
Finally the last one, MiscFeature. This one is similar, NAs simply mean that there aren’t any misc features.
Finally after all this hard work, we shouldn’t have any NAs left in our dataframe, where they don’t make any sense Let’s check whether that’s true:
names(which(colSums(is.na(data)) > 0))
[1] "Alley" "BsmtQual" "BsmtCond" "BsmtExposure" "BsmtFinType1" "BsmtFinType2" "FireplaceQu"
[8] "GarageType" "GarageYrBlt" "GarageFinish" "GarageQual" "GarageCond" "PoolQC" "Fence"
[15] "MiscFeature"
It is!
After we’ve dealt with all of the NAs, let’s check whether everything is the correct type:
str(data)
'data.frame': 1460 obs. of 81 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
$ MSZoning : chr "RL" "RL" "RL" "RL" ...
$ LotFrontage : num 65 80 68 60 84 85 75 0 51 50 ...
$ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
$ Street : chr "Pave" "Pave" "Pave" "Pave" ...
$ Alley : chr NA NA NA NA ...
$ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
$ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
$ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
$ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
$ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
$ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
$ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
$ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
$ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
$ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
$ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
$ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
$ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
$ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
$ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
$ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
$ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
$ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
$ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
$ MasVnrArea : num 196 0 162 0 350 0 186 240 0 0 ...
$ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
$ ExterCond : chr "TA" "TA" "TA" "TA" ...
$ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
$ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
$ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
$ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
$ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
$ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
$ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
$ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
$ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
$ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
$ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
$ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
$ CentralAir : chr "Y" "Y" "Y" "Y" ...
$ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
$ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
$ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
$ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
$ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
$ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
$ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
$ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
$ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
$ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
$ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
$ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
$ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
$ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
$ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
$ FireplaceQu : chr NA "TA" "TA" "Gd" ...
$ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
$ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
$ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
$ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
$ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
$ GarageQual : chr "TA" "TA" "TA" "TA" ...
$ GarageCond : chr "TA" "TA" "TA" "TA" ...
$ PavedDrive : chr "Y" "Y" "Y" "Y" ...
$ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
$ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
$ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
$ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
$ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
$ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
$ PoolQC : chr NA NA NA NA ...
$ Fence : chr NA NA NA NA ...
$ MiscFeature : chr NA NA NA NA ...
$ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
$ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
$ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
$ SaleType : chr "WD" "WD" "WD" "WD" ...
$ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
$ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
Nothing seems out of order, but we still have a bunch of character vectors. We need to encode them in a way, that our models can use. Let’s convert them to factors. This way R can automatically dummy code them when building models.
First things first, we have to find out which parameters are strings, so we can know which ones to convert to factors:
char_parms <- colnames(data[sapply(data, is.character)])
char_parms
[1] "MSZoning" "Street" "Alley" "LotShape" "LandContour" "Utilities" "LotConfig"
[8] "LandSlope" "Neighborhood" "Condition1" "Condition2" "BldgType" "HouseStyle" "RoofStyle"
[15] "RoofMatl" "Exterior1st" "Exterior2nd" "MasVnrType" "ExterQual" "ExterCond" "Foundation"
[22] "BsmtQual" "BsmtCond" "BsmtExposure" "BsmtFinType1" "BsmtFinType2" "Heating" "HeatingQC"
[29] "CentralAir" "Electrical" "KitchenQual" "Functional" "FireplaceQu" "GarageType" "GarageFinish"
[36] "GarageQual" "GarageCond" "PavedDrive" "PoolQC" "Fence" "MiscFeature" "SaleType"
[43] "SaleCondition"
As we can see, we have a bit more than 40 parameters which are characters. Let’s convert them to factors:
data[char_parms] <- lapply(data[char_parms], factor)
Let’s check whether we were successful:
str(data)
'data.frame': 1460 obs. of 81 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
$ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
$ LotFrontage : num 65 80 68 60 84 85 75 0 51 50 ...
$ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
$ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
$ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
$ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
$ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
$ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
$ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
$ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
$ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
$ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
$ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
$ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
$ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
$ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
$ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
$ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
$ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
$ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
$ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
$ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
$ MasVnrArea : num 196 0 162 0 350 0 186 240 0 0 ...
$ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
$ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
$ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
$ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
$ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
$ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
$ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
$ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
$ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
$ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
$ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
$ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
$ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
$ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
$ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
$ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
$ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
$ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
$ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
$ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
$ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
$ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
$ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
$ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
$ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
$ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
$ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
$ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
$ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
$ FireplaceQu : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
$ GarageType : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
$ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
$ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
$ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
$ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
$ GarageQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
$ GarageCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
$ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
$ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
$ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
$ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
$ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
$ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
$ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
$ PoolQC : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
$ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
$ MiscFeature : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
$ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
$ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
$ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
$ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
$ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
$ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
We were!
Now, after cleaning the data, let’s check out the correlations, to see, which parameters should we pay more attention to.
First let’s see for the numeric values:
num_parms <- colnames(data[sapply(data, is.numeric)])
num_parms
[1] "Id" "MSSubClass" "LotFrontage" "LotArea" "OverallQual" "OverallCond" "YearBuilt"
[8] "YearRemodAdd" "MasVnrArea" "BsmtFinSF1" "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "X1stFlrSF"
[15] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath" "BsmtHalfBath" "FullBath" "HalfBath"
[22] "BedroomAbvGr" "KitchenAbvGr" "TotRmsAbvGrd" "Fireplaces" "GarageYrBlt" "GarageCars" "GarageArea"
[29] "WoodDeckSF" "OpenPorchSF" "EnclosedPorch" "X3SsnPorch" "ScreenPorch" "PoolArea" "MiscVal"
[36] "MoSold" "YrSold" "SalePrice"
numcors <- cor(data[,num_parms], use = "complete.obs")
corrplot(numcors, type = "lower")
This is hard to read, but we can already see that we don’t need all of these parameters.
Let’s check out the more relevant ones:
relevant_names <- names(numcors[38,numcors[38,] > 0.5])
relcors <- cor(data[,relevant_names], use = "complete.obs")
corrplot(relcors, type = "lower")
Let’s check them out:
scatterplot(data$OverallQual, data$SalePrice, regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))
scatterplot(data$GrLivArea, data$SalePrice, xlim = c(250, 3000), ylim = c(0, 500000), regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))
scatterplot(data$GrLivArea, data$TotRmsAbvGrd, regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))
scatterplot(data$OverallQual, data$GrLivArea, regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))
pairs(data[,c("SalePrice", "GrLivArea", "TotRmsAbvGrd")])
We can see a couple of obvious correlations, that don’t mean anything, such as: between the GarageArea and GarageCars, and GrLivArea and TotRmsAbvGrd.
Let’s calculate a new parameter, the price per square feet:
data$ppsqf <- data$SalePrice / data$GrLivArea
scatterplot(data$OverallQual, data$ppsqf, regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))
Let’s check out the correlation between this new parameters and the old ones:
relevant_names2 <- c(relevant_names, "ppsqf")
relevant_names2
[1] "OverallQual" "YearBuilt" "YearRemodAdd" "TotalBsmtSF" "X1stFlrSF" "GrLivArea" "FullBath"
[8] "TotRmsAbvGrd" "GarageCars" "GarageArea" "SalePrice" "ppsqf"
relcors2 <- cor(data[,relevant_names2], use = "complete.obs")
corrplot(relcors2, type = "lower")
It seems as the price per square feet has risen over the years. Let’s find it out:
scatterplot(data$YearBuilt, data$ppsqf, regLine = list(col="green"), smooth=list(col.smooth="red", col.spread="black"))
We were right.
Let’s build some models. I would like to use the caret package to build a linear and an exponential regression model.
Let’s create the data partitions first:
library(caret)
target <- data$SalePrice
trainIdx <- createDataPartition(target, p = .75)
traindata <- data[trainIdx$Resample1,]
testdata <- data[-trainIdx$Resample1,]
str(traindata)
'data.frame': 1097 obs. of 82 variables:
$ Id : int 1 2 4 5 6 8 9 10 11 13 ...
$ MSSubClass : int 60 20 70 60 50 60 50 190 20 20 ...
$ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 5 4 4 4 ...
$ LotFrontage : num 65 80 60 84 85 0 51 50 70 0 ...
$ LotArea : int 8450 9600 9550 14260 14115 10382 6120 7420 11200 12968 ...
$ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
$ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
$ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 4 4 2 ...
$ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
$ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 1 3 5 1 5 1 5 5 ...
$ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
$ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 7 14 12 17 18 4 19 19 ...
$ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 5 1 1 3 3 ...
$ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 1 3 3 ...
$ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 2 1 1 ...
$ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 1 6 1 2 3 3 ...
$ OverallQual : int 7 6 7 8 5 7 7 5 5 5 ...
$ OverallCond : int 5 8 5 5 5 6 5 6 5 6 ...
$ YearBuilt : int 2003 1976 1915 2000 1993 1973 1931 1939 1965 1962 ...
$ YearRemodAdd : int 2003 1976 1970 2000 1995 1973 1950 1950 1965 1962 ...
$ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 4 4 ...
$ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 14 13 13 7 4 9 7 7 ...
$ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 16 14 14 7 16 9 7 11 ...
$ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 3 2 3 4 3 3 3 3 ...
$ MasVnrArea : num 196 0 0 350 0 240 0 0 0 0 ...
$ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 4 3 4 4 4 4 4 4 ...
$ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 1 3 6 2 1 1 2 2 ...
$ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 4 3 3 3 4 4 4 4 ...
$ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 2 4 4 4 4 4 4 4 ...
$ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 4 1 4 3 4 4 4 4 ...
$ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 1 3 3 1 6 3 5 1 ...
$ BsmtFinSF1 : int 706 978 216 655 732 859 0 851 906 737 ...
$ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 2 6 6 6 6 ...
$ BsmtFinSF2 : int 0 0 0 0 0 32 0 0 0 0 ...
$ BsmtUnfSF : int 150 284 540 490 64 216 952 140 134 175 ...
$ TotalBsmtSF : int 856 1262 756 1145 796 1107 952 991 1040 912 ...
$ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
$ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 3 1 1 1 3 1 1 5 ...
$ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
$ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 2 5 5 5 ...
$ X1stFlrSF : int 856 1262 961 1145 796 1107 1022 1077 1040 912 ...
$ X2ndFlrSF : int 854 0 756 1053 566 983 752 0 0 0 ...
$ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
$ GrLivArea : int 1710 1262 1717 2198 1362 2090 1774 1077 1040 912 ...
$ BsmtFullBath : int 1 0 1 1 1 1 0 1 1 1 ...
$ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
$ FullBath : int 2 2 1 2 1 2 2 1 1 1 ...
$ HalfBath : int 1 0 0 1 1 1 0 0 0 0 ...
$ BedroomAbvGr : int 3 3 3 4 1 3 2 2 3 2 ...
$ KitchenAbvGr : int 1 1 1 1 1 1 2 2 1 1 ...
$ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 4 4 4 4 4 4 ...
$ TotRmsAbvGrd : int 8 6 7 9 5 7 8 5 5 4 ...
$ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 3 7 7 7 ...
$ Fireplaces : int 0 1 1 1 0 2 2 2 0 0 ...
$ FireplaceQu : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 3 5 NA 5 5 5 NA NA ...
$ GarageType : Factor w/ 6 levels "2Types","Attchd",..: 2 2 6 2 2 2 6 2 6 6 ...
$ GarageYrBlt : int 2003 1976 1998 2000 1993 1973 1931 1939 1965 1962 ...
$ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 3 2 3 2 3 2 3 3 ...
$ GarageCars : int 2 2 3 3 2 2 2 1 1 1 ...
$ GarageArea : int 548 460 642 836 480 484 468 205 384 352 ...
$ GarageQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 2 3 5 5 ...
$ GarageCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
$ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
$ WoodDeckSF : int 0 298 0 192 40 235 90 0 0 140 ...
$ OpenPorchSF : int 61 0 35 84 30 204 0 4 0 0 ...
$ EnclosedPorch: int 0 0 272 0 0 228 205 0 0 0 ...
$ X3SsnPorch : int 0 0 0 0 320 0 0 0 0 0 ...
$ ScreenPorch : int 0 0 0 0 0 0 0 0 0 176 ...
$ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
$ PoolQC : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
$ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA 3 NA NA NA NA NA ...
$ MiscFeature : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA 3 3 NA NA NA NA ...
$ MiscVal : int 0 0 0 0 700 350 0 0 0 0 ...
$ MoSold : int 2 5 2 12 10 11 4 1 2 9 ...
$ YrSold : int 2008 2007 2006 2008 2009 2009 2008 2008 2008 2008 ...
$ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
$ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 1 5 5 5 1 5 5 5 ...
$ SalePrice : int 208500 181500 140000 250000 143000 200000 129900 118000 129500 144000 ...
$ ppsqf : num 121.9 143.8 81.5 113.7 105 ...
After creating the partitions, let’s build the model.
First let’s just use one parameter:
model <- lm(SalePrice~OverallQual, data = traindata)
summary(model)
Call:
lm(formula = SalePrice ~ OverallQual, data = traindata)
Residuals:
Min 1Q Median 3Q Max
-176057 -29307 -1932 20693 394193
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -97943 6761 -14.49 <2e-16 ***
OverallQual 45875 1085 42.26 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 49240 on 1095 degrees of freedom
Multiple R-squared: 0.62, Adjusted R-squared: 0.6196
F-statistic: 1786 on 1 and 1095 DF, p-value: < 2.2e-16
plot(model)
shapiro.test(model$residuals)
Shapiro-Wilk normality test
data: model$residuals
W = 0.88868, p-value < 2.2e-16
confint(model)
2.5 % 97.5 %
(Intercept) -111209.66 -84676.94
OverallQual 43745.25 48004.80
cor(traindata$SalePrice, model$fitted.values)
[1] 0.7873731
model
Call:
lm(formula = SalePrice ~ OverallQual, data = traindata)
Coefficients:
(Intercept) OverallQual
-97943 45875
prediction <- predict(model, testdata, type="response")
model_output <- cbind(testdata, prediction)
model_output$log_prediction <- log(model_output$prediction)
NaNs produced
model_output$log_SalePrice <- log(model_output$SalePrice)
rmse <- function(fittedvals, truevals){
sqrt(mean((fittedvals - truevals)^2))
}
rmse(model_output$log_SalePrice,model_output$log_prediction)
[1] NaN
As we can see, our model isn’t any good. Let’s try a different approach, with more parameters:
model2 <- lm(SalePrice~OverallQual+GrLivArea, data = traindata)
summary(model2)
Call:
lm(formula = SalePrice ~ OverallQual + GrLivArea, data = traindata)
Residuals:
Min 1Q Median 3Q Max
-312839 -23114 -499 20362 284307
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -108567.53 5883.22 -18.45 <2e-16 ***
OverallQual 32823.44 1162.10 28.25 <2e-16 ***
GrLivArea 59.44 3.11 19.11 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 42650 on 1094 degrees of freedom
Multiple R-squared: 0.7151, Adjusted R-squared: 0.7146
F-statistic: 1373 on 2 and 1094 DF, p-value: < 2.2e-16
Let’s evaluate it:
plot(model2)
shapiro.test(model2$residuals)
Shapiro-Wilk normality test
data: model2$residuals
W = 0.90193, p-value < 2.2e-16
confint(model2)
2.5 % 97.5 %
(Intercept) -120111.19159 -97023.865
OverallQual 30543.23576 35103.644
GrLivArea 53.33354 65.538
cor(traindata$SalePrice, model2$fitted.values)
[1] 0.8456234
model2
Call:
lm(formula = SalePrice ~ OverallQual + GrLivArea, data = traindata)
Coefficients:
(Intercept) OverallQual GrLivArea
-108567.53 32823.44 59.44
prediction2 <- predict(model2, testdata, type="response")
model2_output <- cbind(testdata, prediction2)
model2_output$log_prediction <- log(model2_output$prediction)
NaNs produced
model2_output$log_SalePrice <- log(model2_output$SalePrice)
rmse(model2_output$log_SalePrice, model2_output$log_prediction)
[1] NaN
This somehow actually worsened our model, I’m not exactly sure why. Anyway, let’s try to give it more parameters:
model3 <- lm(SalePrice~OverallQual+YearBuilt+YearRemodAdd+TotalBsmtSF+X1stFlrSF+
GrLivArea+FullBath+TotRmsAbvGrd+GarageCars+GarageArea+ppsqf, data = traindata)
summary(model3)
Call:
lm(formula = SalePrice ~ OverallQual + YearBuilt + YearRemodAdd +
TotalBsmtSF + X1stFlrSF + GrLivArea + FullBath + TotRmsAbvGrd +
GarageCars + GarageArea + ppsqf, data = traindata)
Residuals:
Min 1Q Median 3Q Max
-250178 -5733 187 5790 143316
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.302e+04 7.824e+04 -0.678 0.49816
OverallQual 2.007e+03 7.503e+02 2.675 0.00760 **
YearBuilt 3.202e+01 3.073e+01 1.042 0.29769
YearRemodAdd -1.102e+02 3.780e+01 -2.917 0.00361 **
TotalBsmtSF 6.385e-01 2.702e+00 0.236 0.81323
X1stFlrSF 1.725e+00 2.983e+00 0.578 0.56314
GrLivArea 1.187e+02 2.733e+00 43.417 < 2e-16 ***
FullBath -5.862e+02 1.581e+03 -0.371 0.71091
TotRmsAbvGrd -8.881e+01 6.594e+02 -0.135 0.89288
GarageCars 2.261e+01 1.807e+03 0.013 0.99002
GarageArea 1.242e+00 6.070e+00 0.205 0.83794
ppsqf 1.625e+03 3.068e+01 52.969 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19020 on 1085 degrees of freedom
Multiple R-squared: 0.9438, Adjusted R-squared: 0.9432
F-statistic: 1656 on 11 and 1085 DF, p-value: < 2.2e-16
Linear regression models are clearly not the way to go. Let’s try some exponential ones.
model4 <- lm(log(SalePrice) ~ OverallQual, data = traindata)
summary(model4)
Call:
lm(formula = log(SalePrice) ~ OverallQual, data = traindata)
Residuals:
Min 1Q Median 3Q Max
-1.07183 -0.12568 0.01076 0.12451 0.71575
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.584273 0.031367 337.44 <2e-16 ***
OverallQual 0.236949 0.005036 47.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2284 on 1095 degrees of freedom
Multiple R-squared: 0.6691, Adjusted R-squared: 0.6688
F-statistic: 2214 on 1 and 1095 DF, p-value: < 2.2e-16
plot(model4)
shapiro.test(model4$residuals)
Shapiro-Wilk normality test
data: model4$residuals
W = 0.98069, p-value = 6.697e-11
confint(model4)
2.5 % 97.5 %
(Intercept) 10.5227273 10.6458192
OverallQual 0.2270685 0.2468296
cor(traindata$SalePrice, model4$fitted.values)
[1] 0.7873731
model4
Call:
lm(formula = log(SalePrice) ~ OverallQual, data = traindata)
Coefficients:
(Intercept) OverallQual
10.5843 0.2369
As we can see, this model is fairly better than our previous attempts. Let’s try the same thing, with more parameters.
model5 <- lm(log(SalePrice)~OverallQual+YearBuilt+YearRemodAdd+TotalBsmtSF+X1stFlrSF+
GrLivArea+FullBath+TotRmsAbvGrd+GarageCars+GarageArea+ppsqf, data = traindata)
summary(model5)
Call:
lm(formula = log(SalePrice) ~ OverallQual + YearBuilt + YearRemodAdd +
TotalBsmtSF + X1stFlrSF + GrLivArea + FullBath + TotRmsAbvGrd +
GarageCars + GarageArea + ppsqf, data = traindata)
Residuals:
Min 1Q Median 3Q Max
-1.09770 -0.03747 0.01641 0.05551 0.17188
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.553e+00 3.835e-01 19.692 < 2e-16 ***
OverallQual 2.396e-02 3.678e-03 6.514 1.12e-10 ***
YearBuilt 8.148e-04 1.506e-04 5.409 7.78e-08 ***
YearRemodAdd 5.140e-04 1.853e-04 2.774 0.005630 **
TotalBsmtSF 1.249e-05 1.324e-05 0.943 0.345966
X1stFlrSF -4.762e-06 1.462e-05 -0.326 0.744755
GrLivArea 4.812e-04 1.340e-05 35.909 < 2e-16 ***
FullBath 5.133e-03 7.751e-03 0.662 0.507918
TotRmsAbvGrd 1.210e-02 3.232e-03 3.742 0.000192 ***
GarageCars 3.333e-02 8.856e-03 3.764 0.000176 ***
GarageArea -3.522e-05 2.975e-05 -1.184 0.236740
ppsqf 6.915e-03 1.504e-04 45.979 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09325 on 1085 degrees of freedom
Multiple R-squared: 0.9454, Adjusted R-squared: 0.9448
F-statistic: 1706 on 11 and 1085 DF, p-value: < 2.2e-16
plot(model5)
shapiro.test(model5$residuals)
Shapiro-Wilk normality test
data: model5$residuals
W = 0.82302, p-value < 2.2e-16
confint(model5)
2.5 % 97.5 %
(Intercept) 6.800319e+00 8.305464e+00
OverallQual 1.674089e-02 3.117387e-02
YearBuilt 5.192312e-04 1.110337e-03
YearRemodAdd 1.504410e-04 8.775254e-04
TotalBsmtSF -1.349985e-05 3.847460e-05
X1stFlrSF -3.345656e-05 2.393213e-05
GrLivArea 4.548603e-04 5.074428e-04
FullBath -1.007478e-02 2.034143e-02
TotRmsAbvGrd 5.754194e-03 1.843926e-02
GarageCars 1.595416e-02 5.070691e-02
GarageArea -9.360557e-05 2.315829e-05
ppsqf 6.620080e-03 7.210296e-03
cor(traindata$SalePrice, model5$fitted.values)
[1] 0.9653806
model5
Call:
lm(formula = log(SalePrice) ~ OverallQual + YearBuilt + YearRemodAdd +
TotalBsmtSF + X1stFlrSF + GrLivArea + FullBath + TotRmsAbvGrd +
GarageCars + GarageArea + ppsqf, data = traindata)
Coefficients:
(Intercept) OverallQual YearBuilt YearRemodAdd TotalBsmtSF X1stFlrSF GrLivArea FullBath
7.553e+00 2.396e-02 8.148e-04 5.140e-04 1.249e-05 -4.762e-06 4.812e-04 5.133e-03
TotRmsAbvGrd GarageCars GarageArea ppsqf
1.210e-02 3.333e-02 -3.522e-05 6.915e-03
As we can see, our model did not improve, on the opposite, it worsened.
I would love to continue working on this exercise, improving my models, examining the target parameter in relation to groups of objects, such as city areas, or types of buildings, but sadly I’m out of time.
As a last thing, I tried to check out the whether the prices and the sizes of living areas are affected by the neighborhood. We can only see some correlation in the outliers.
plot(data$GrLivArea, data$SalePrice, col=data$Neighborhood)
As a conclusion, I still have a lot to learn and would require a lot more time to properly solve the problem. My current best model only uses a single parameter. If I have the time for it in the future, I’ll return to try to solve it properly.